Explaining predictions of vote shares

Here, we build random forest models of vote shares in the Swedish general election of 2014. The input data consist of (i) vote percentages for the 290 municipalities in Sweden, (ii) a dataset with various indicators for these municipalities, such as unemployment rate, urban/rural and so on, and (iii) number of asylum seekers per capita for munipalities in 2014.

In this example, we will demonstrate LIME's tabular mode for when you have "normal" tabular data and not e.g. text or image data. We will show LIME's "tabular explainer" both in regression and in classification mode.

In [14]:
import pandas as pd
import sklearn
import sklearn.datasets
import sklearn.ensemble
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import numpy as np
import lime
import lime.lime_tabular
%matplotlib inline
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode()
import mpld3 # If plotly is down
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
In [15]:
vote_data = pd.read_csv("municipality_votes_2014.tsv", delimiter="\t", index_col=0)
columns_to_use = ["PROCENT_M","PROCENT_C","PROCENT_FP","PROCENT_KD","PROCENT_S","PROCENT_V","PROCENT_MP","PROCENT_SD","PROCENT_FI"]
vote_perc = vote_data[columns_to_use]
vote_perc.columns = [x.split('_')[1] for x in vote_perc.columns ]

After some light pre-processing, we get a matrix with the vote share for the nine largest parties per municipality.

In [16]:
vote_perc.head()
Out[16]:
M C FP KD S V MP SD FI
ID
780 24.1 7.7 4.3 5.0 31.3 5.1 7.0 12.3 2.3
1487 17.8 6.6 4.8 4.5 35.1 6.6 5.9 16.0 1.9
1765 19.1 12.6 5.7 8.0 29.6 3.2 2.3 18.2 0.7
1293 18.9 6.1 3.8 5.2 30.0 3.1 6.8 23.7 1.7
2403 21.4 12.2 5.6 6.1 38.1 4.3 1.7 9.4 0.7

Now we join this table with the municipality descriptors and asylum numbers.

In [17]:
municipality_desc = pd.read_excel("kommundata.xlsx")
asylum_seekers = pd.read_excel("asylsdochm.xlsx")
asylum_seekers = asylum_seekers.iloc[2:, [1,4]]
asylum_seekers.columns = ["Municipality","AsylumSeekers"]
asylum_seekers.AsylumSeekers=asylum_seekers.AsylumSeekers.astype(float)
muni = municipality_desc.merge(asylum_seekers, left_on="name", right_on="Municipality")
muni.head()
muni.index = muni.name

Great, now we have a table which contains the information we need. For convenience, we also make two lookup tables that can find the municipality code from the municipality name, and vice versa.

In [18]:
code_from_name = {}
name_from_code = {}
for row in range(muni.shape[0]):
    code_from_name[muni.iloc[row,1]]=muni.iloc[row,0]
    name_from_code[muni.iloc[row,0]]=muni.iloc[row,1]

Next, we drop some of the features that are either redundant or that we don't think are interesting.

In [19]:
features_to_drop = ['code', 'name', 'youthUnemployment2010', 'unemployment2010', 'satisfactionInfluence','satisfactionGeneral', 'satisfactionElderlyCare', 'Municipality']
muni = muni.drop(features_to_drop, axis=1)
muni.head()
Out[19]:
medianIncome youthUnemployment2013 unemployment2013 unemploymentChange reportedCrime populationChange hasEducation asylumCosts urbanDegree foreignBorn ... refugees rentalApartments governing fokusRanking foretagsklimatRanking cars motorcycles tractors snowmobiles AsylumSeekers
name
Växjö 201793 8.7 8.0 0.1 831 0.8 27.5 333 49.1 11.7 ... 33.8 234.1 Borgerligt 19 24 3651.694532 200.542250 260.370538 4.609128 4.95
Vänersborg 195999 13.2 9.4 -0.1 641 2.7 35.0 0 67.4 6.7 ... 14.4 164.8 Blandat 80 231 2230.455552 138.180123 226.960270 8.155657 19.78
Årjäng 182826 5.5 5.3 -1.9 575 -3.5 22.8 1960 72.8 8.8 ... 11.2 129.1 Borgerligt 204 113 765.670911 46.767875 183.031342 11.263467 13.36
Hässleholm 237817 13.1 9.6 0.5 751 3.9 30.7 749 88.9 15.5 ... 13.2 136.9 Borgerligt 83 187 1311.598558 103.215144 181.189904 0.951522 12.48
Bjurholm 168796 11.5 10.2 3.5 205 -3.2 25.1 546 39.3 11.2 ... 52.0 87.6 Borgerligt 266 38 565.270936 38.177340 176.518883 186.781609 4.93

5 rows × 28 columns

As our final pre-processing steps, we convert categorical variables to dummy variables and create a couple of binary classification variables for whether SD or C achieved more than 15% of the votes in each municipality.

In [20]:
X = pd.get_dummies(muni)
sd_over_15 = vote_perc.SD > 15.0
c_over_10 = vote_perc.C > 10

Regression case

Let's try to model the vote share of MP, for instance.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, vote_perc.MP, test_size=0.33, random_state=42)
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)
Out[21]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

How are we doing on the test set?

In [23]:
trace = go.Scatter(
    x = rf.predict(X_test),
    y = y_test,
    mode = 'markers',
    text = X_test.index
)

layout = go.Layout(
    title='MP vote share, actual vs predicted',
    xaxis=dict(
        title='Predicted'
    ),
    yaxis=dict(
        title='Actual'
    )
)

data = [trace]

# Plot and embed in ipython notebook!
from plotly.offline import iplot
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='axss-labels')
#iplot(fig, filename='axss-labels')
Out[23]:
In [11]:
# Backup cell
from matplotlib import pyplot as plt 

fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))
N = len(X_test)

scatter = ax.scatter(rf.predict(X_test),
                 y_test) #,
                 #c=np.random.random(size=N),
                 #s=1000 * np.random.random(size=N),
                 #alpha=0.3,
                 #cmap=plt.cm.jet)
ax.grid(color='white', linestyle='solid')
ax.set_title("Actual vs. predicted", size=20)
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=list(X_test.index.values))
mpld3.plugins.connect(fig, tooltip)
mpld3.display()
/Users/mikaelhuss/anaconda3/lib/python3.6/site-packages/matplotlib/cbook.py:136: MatplotlibDeprecationWarning:

The axisbg attribute was deprecated in version 2.0. Use facecolor instead.

Out[11]:

Let's try to explain a prediction from above! Note that the input need to be a (2d) Numpy array, and not for example a Pandas data frame.

In [10]:
ix = np.where(X_test.index == 'Täby')[0].tolist()[0]
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train), mode='regression', feature_names=X.columns.tolist(), training_labels=y_train)
exp = explainer.explain_instance(np.array(X_test)[ix,:], rf.predict, num_features=3, top_labels=1)
exp.show_in_notebook(show_table=True, show_all=True)

Binary classifier decision

We can try to build a classifier that predicts whether some party will get at least a certain percentage, e.g. does SD get more than 15% or C more than 10%?

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, sd_over_15, test_size=0.33, random_state=42)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
sklearn.metrics.confusion_matrix(y_pred=rf.predict(X_test), y_true=y_test)
#len(np.where(y_test==True)[0]) / len(y_test)
Out[12]:
array([[46,  3],
       [ 9, 38]])

Let's look at false positives, where the algorithm thought SD would get >15% but it didn't. Why did it think so?

In [13]:
df = pd.DataFrame({'predicted': rf.predict(X_test),
             'actual': y_test,
             'town': X_test.index})
df[(df['actual'] != df['predicted'])]
Out[13]:
actual predicted town
ID
1495 True False Skara
1292 True False Ängelholm
1862 False True Degerfors
765 True False Älmhult
1231 True False Burlöv
2081 True False Borlänge
1492 True False Åmål
643 True False Habo
1497 False True Hjo
1715 False True Kil
2361 True False Härjedalen
2062 True False Mora

For example, Degerfors. What percentage did it have in fact?

In [13]:
vote_perc.loc[1862]
Out[13]:
M     10.6
C      4.1
FP     1.9
KD     3.4
S     51.4
V      8.0
MP     4.0
SD    14.4
FI     1.5
Name: 1862, dtype: float64
In [14]:
ix = np.where(X_test.index == 'Degerfors')[0].tolist()[0]
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train), feature_names=X.columns.tolist(), training_labels=sd_over_15)
exp = explainer.explain_instance(np.array(X_test)[ix,:], rf.predict_proba, num_features=3, top_labels=1)
exp.show_in_notebook(show_table=True, show_all=False)

Multi-class classification

We could also try to predict which party will get the most votes. Check which parties that actually have received the most votes in any municipality.

In [15]:
largest = vote_perc.idxmax(axis=1)
largest.value_counts()
Out[15]:
S     248
M      40
SD      2
dtype: int64
In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, largest, test_size=0.33, random_state=42)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
sklearn.metrics.confusion_matrix(y_pred=rf.predict(X_test), y_true=y_test)
Out[16]:
array([[13,  5],
       [ 6, 72]])

We could look at, for example, where the random forest predicted S as the largest party but M actually won.

In [17]:
df = pd.DataFrame({'predicted': rf.predict(X_test),
             'actual': y_test,
             'town': X_test.index})
df[(df['actual'] == 'M') & (df['predicted'] == 'S')]
Out[17]:
actual predicted town
ID
1292 M S Ängelholm
486 M S Strängnäs
488 M S Trosa
1427 M S Sotenäs
1419 M S Tjörn
In [18]:
ix = np.where(X_test.index == 'Ängelholm')[0].tolist()[0]
In [19]:
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train), feature_names=X.columns.tolist(), training_labels=largest, class_names=['M','S','SD'])
exp = explainer.explain_instance(np.array(X_test)[ix,:], rf.predict_proba, num_features=3, top_labels=2)
exp.show_in_notebook(show_table=True, show_all=False)

Or the converse (predicted M, but S was the largest.)

In [20]:
df[(df['actual'] == 'S') & (df['predicted'] == 'M')]
Out[20]:
actual predicted town
ID
136 S M Haninge
1231 S M Burlöv
643 S M Habo
1814 S M Lekeberg
1482 S M Kungälv
139 S M Upplands-Bro
In [21]:
ix = np.where(X_test.index == 'Kungälv')[0].tolist()[0]
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train), feature_names=X.columns.tolist(), training_labels=largest, class_names=['M','S','SD'])
exp = explainer.explain_instance(np.array(X_test)[ix,:], rf.predict_proba, num_features=3, top_labels=2)
exp.show_in_notebook(show_table=True, show_all=True)
In [ ]: